{ "cells": [ { "cell_type": "markdown", "id": "55f95cff", "metadata": {}, "source": [ "# Input Update Benchmark Notebook\n", "This notebook runs the input update benchmark for ParquetDB\n", "across various input data types and row counts, then plots the results.\n", "\n", "\n", "---\n", "\n", "## Benchmark Details\n", "\n", "- **Data Generation:** \n", " - 1,000,000 rows × 100 columns of integers (0–1,000,000). \n", " - Integers chosen as a basic primitive type—benchmark times here represent a **lower bound** on update performance; more complex or larger types will incur higher costs (due to larger byte sizes).\n", "\n", "- **Parquet Normalization Settings (defaults):** \n", " - **Row groups:** minimum 50,000–100,000 rows per group \n", " - **File size cap:** maximum 10,000,000 rows per file \n", " - You can tune these settings (e.g. smaller row groups or larger file limits) to trade off update vs. read performance.\n", "\n", "---\n", "\n", "## System Specifications\n", "\n", "- **Operating System:** Windows 10 \n", "- **Processor:** AMD Ryzen 7 3700X 8‑Core @ 3.6 MHz (8 cores, 16 logical processors) \n", "- **Memory:** 128 GB DDR4‑3600 MHz (4×32 GB DIMMs) \n", "- **Storage**: SATA HDD 2TB (Model: ST2000DM008-2FR102)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "d15a6c16", "metadata": {}, "source": [ "## 1. Setup\n", "Import required libraries and set up paths." ] }, { "cell_type": "code", "execution_count": null, "id": "4679e232", "metadata": {}, "outputs": [], "source": [ "!pip install parquetdb" ] }, { "cell_type": "code", "execution_count": 1, "id": "68c6d745", "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "import shutil\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import matplotlib.ticker as ticker\n", "from mpl_toolkits.axes_grid1.inset_locator import inset_axes\n", "import pyarrow as pa\n", "from parquetdb import ParquetDB, config\n", "from parquetdb.utils import general_utils\n", "\n", "# Set up data directory\n", "bench_dir= os.path.join(config.data_dir, 'benchmarks')\n", "save_dir = os.path.join(bench_dir, 'parquetdb')\n", "os.makedirs(save_dir, exist_ok=True)\n", "db_path = os.path.join(save_dir, 'BenchmarkDB')\n" ] }, { "cell_type": "markdown", "id": "1b6f56de", "metadata": {}, "source": [ "## 2. Initialize Database\n", "Remove any existing database and create a fresh one with 1M rows of random data." ] }, { "cell_type": "code", "execution_count": 2, "id": "1a832dd7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[INFO] 2025-04-19 11:26:40 - parquetdb.core.parquetdb[205][__init__] - Initializing ParquetDB with db_path: Z:\\data\\parquetdb\\data\\benchmarks\\parquetdb\\BenchmarkDB\n", "[INFO] 2025-04-19 11:26:40 - parquetdb.core.parquetdb[207][__init__] - verbose: 1\n" ] } ], "source": [ "# Remove existing DB\n", "if os.path.exists(db_path):\n", " shutil.rmtree(db_path)\n", "\n", "# Initialize and populate\n", "db = ParquetDB(db_path=db_path)\n", "data = general_utils.generate_pydict_data(\n", " n_rows=1_000_000, n_columns=100, min_value=0, max_value=1_000_000\n", ")\n", "db.create(data)\n", "del data, db\n" ] }, { "cell_type": "markdown", "id": "2caa07e6", "metadata": {}, "source": [ "## 3. Run Input Update Benchmark\n", "Benchmark update performance for different input types and row counts." ] }, { "cell_type": "code", "execution_count": 3, "id": "fb1e397b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input=pydict, rows=1, time=9.9136s\n", "Input=pylist, rows=1, time=9.4008s\n", "Input=pandas, rows=1, time=9.5054s\n", "Input=table, rows=1, time=9.4650s\n", "Input=pydict, rows=10, time=9.5665s\n", "Input=pylist, rows=10, time=9.4898s\n", "Input=pandas, rows=10, time=9.8546s\n", "Input=table, rows=10, time=9.0859s\n", "Input=pydict, rows=100, time=9.4329s\n", "Input=pylist, rows=100, time=10.9167s\n", "Input=pandas, rows=100, time=9.4779s\n", "Input=table, rows=100, time=9.9870s\n", "Input=pydict, rows=1000, time=9.5006s\n", "Input=pylist, rows=1000, time=9.0437s\n", "Input=pandas, rows=1000, time=8.5357s\n", "Input=table, rows=1000, time=9.2891s\n", "Input=pydict, rows=10000, time=8.3776s\n", "Input=pylist, rows=10000, time=9.3234s\n", "Input=pandas, rows=10000, time=9.1188s\n", "Input=table, rows=10000, time=9.2263s\n", "Input=pydict, rows=100000, time=9.9234s\n", "Input=pylist, rows=100000, time=11.9785s\n", "Input=pandas, rows=100000, time=9.4006s\n", "Input=table, rows=100000, time=9.3413s\n", "Input=pydict, rows=1000000, time=14.9886s\n", "Input=pylist, rows=1000000, time=32.1442s\n", "Input=pandas, rows=1000000, time=9.5906s\n", "Input=table, rows=1000000, time=10.9055s\n", "Benchmark results saved to Z:\\data\\parquetdb\\data\\benchmarks\\parquetdb\\parquetdb_input_update_benchmark.csv\n" ] } ], "source": [ "# Benchmark parameters\n", "row_counts = [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]\n", "input_types = ['pydict', 'pylist', 'pandas', 'table']\n", "\n", "# Storage for results\n", "results = {\n", " 'input_data_type': [],\n", " 'n_rows': [],\n", " 'update_times': []\n", "}\n", "\n", "for n in row_counts:\n", " # generate update table\n", " table = general_utils.generate_table_update_data(\n", " n_rows=n, n_columns=100\n", " )\n", " for itype in input_types:\n", " db = ParquetDB(db_path=db_path)\n", " # prepare input\n", " if itype == 'pydict':\n", " update_data = table.to_pydict()\n", " elif itype == 'pylist':\n", " update_data = table.to_pylist()\n", " elif itype == 'pandas':\n", " update_data = table.to_pandas()\n", " else:\n", " update_data = table\n", "\n", " # run update\n", " start = time.perf_counter()\n", " db.update(update_data)\n", " elapsed = time.perf_counter() - start\n", " # record\n", " results['input_data_type'].append(itype)\n", " results['n_rows'].append(n)\n", " results['update_times'].append(elapsed)\n", "\n", " print(f'Input={itype}, rows={n}, time={elapsed:.4f}s')\n", " del update_data, db\n", "\n", "# save results\n", "df = pd.DataFrame(results)\n", "csv_path = os.path.join(save_dir, 'parquetdb_input_update_benchmark.csv')\n", "df.to_csv(csv_path, index=False)\n", "print('Benchmark results saved to', csv_path)" ] }, { "cell_type": "markdown", "id": "08db830d", "metadata": {}, "source": [ "## 4. Load and Preview Results" ] }, { "cell_type": "code", "execution_count": 3, "id": "e438cb46", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | input_data_type | \n", "n_rows | \n", "update_times | \n", "
|---|---|---|---|
| 0 | \n", "pydict | \n", "1 | \n", "9.913610 | \n", "
| 1 | \n", "pylist | \n", "1 | \n", "9.400836 | \n", "
| 2 | \n", "pandas | \n", "1 | \n", "9.505378 | \n", "
| 3 | \n", "table | \n", "1 | \n", "9.465031 | \n", "
| 4 | \n", "pydict | \n", "10 | \n", "9.566514 | \n", "